34 research outputs found

    An Efficient OpenMP Runtime System for Hierarchical Arch

    Get PDF
    Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application

    EASYPAP: a Framework for Learning Parallel Programming

    Get PDF
    This paper presents EASYPAP, an easy-to-use programming environment designed to help students to learn parallel programming. EASYPAP features a wide range of 2D computation kernels that the students are invited to parallelize using Pthreads, OpenMP, OpenCL or MPI. Execution of kernels can be interactively visualized, and powerful monitoring tools allow students to observe both the scheduling of computations and the assignment of 2D tiles to threads/processes. By focusing on algorithms and data distribution, students can experiment with diverse code variants and tune multiple parameters, resulting in richer problem exploration and faster progress towards efficient solutions. We present selected lab assignments which illustrate how EASYPAP improves the way students explore parallel programming

    An Efficient Multi-level Trace Toolkit for Multi-threaded Applications

    Get PDF
    International audienceNowadays, observing and understanding the behavior and performance of a multithreaded application is nontrivial, especially within a complex multithreaded environment such as a multilevel thread scheduler. In this report, we present a trace toolkit that allows a programmer to precisely analyze the behavior of a multithreaded application. A application's run generates several traces that are merged and analyzed offline. The resulting super-trace contains not only classical information such as the number of elapsed cpu cycles per functions but also details about thread scheduling at multiple levels

    BubbleSched, plate-forme de conception d'ordonnanceurs de threads sur machines hiérarchiques

    Get PDF
    National audienceExploiting full computational power of hierarchical multiprocessor machines with irregular multithreaded applications requires a very careful distribution of threads and data. To achieve most of the available performance, programmers often have to forget about portability and wire down ad hoc placement strategies that highly depend on the architecture. To guarantee the portability of performance, we have defined abstractions called ``bubbles'' for capturing both the hierarchical structure of the application's parallelism, and the hierarchical architecture of the targeted machine. We have defined a set of high level primitives to ease the implementation of dedicated, efficient and portable schedulers. We show the relevance of our approach and describe the mechanisms we developped for easily implementing such schedulers

    Resource aggregation for task-based Cholesky Factorization on top of modern architectures

    Get PDF
    This paper is submitted for review to the Parallel Computing special issue for HCW and HeteroPar 16 workshopsHybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose ag-gregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks, by combining two runtime system schedulers: a global runtime system to schedule the main task graph and a local one one to cope with internal task parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing

    Dynamic Task and Data Placement over NUMA Architectures: an OpenMP Runtime Perspective

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMP provide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into ``scheduling hints'' to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform Next-touch-based data distribution policies and open possibilities for new optimizations

    ForestGOMP: an efficient OpenMP environment for NUMA architectures

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into Scheduling Hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and Next-Touch-based data distribution policies. These techniques provide insights about additional optimizations

    Structuring the execution of OpenMP applications for multicore architectures

    Get PDF
    International audienceThe now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a high-level platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines

    Programmation des architectures hétérogènes à l'aide de tâches hiérarchiques

    Get PDF
    National audienceLes systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleinement la puissance de calcul des architectures hétérogènes complexes. Un modèle de programmation courant est le modèle de soumission séquentielle de tâches (Sequential Task Flow, STF) qui malheureusement ne peut manipuler que des graphes de tâches statiques. Ceci conduit potentiellement à un surcoût lors de la soumission, et le graphe de tâches statique n'est pas nécessairement adapté pour s'exécuter sur un système hétérogène. Une solution standard consiste à trouver un compromis entre la granularité permettant d'exploiter la puissance des accélérateurs et celle nécessaire à la bonne performance des CPU. Pour répondre à ces problèmes, nous proposons d'étendre le modèle STF fourni par le support d'exécution STARPU [4] en y ajoutant la possibilité de transformer certaines tâches en sous-graphes durant l'exécution. Nous appelons ces tâches des tâches hiérarchiques. Cette approche permet d'exprimer des graphes de tâches plus dynamiques. En combinant ce nouveau modèle à un gestionnaire automatique des données, il est possible d'adapter dynamiquement la granularité pour fournir une taille optimale aux différentes ressources de calcul ciblées. Nous montrons dans cet article que le modèle des tâches hiérarchiques est valide et nous donnons une première évaluation de ses performances en utilisant la bibliothèque d'algèbre linéaire dense CHAMELEON [1]

    Programmation des Architectures Hétérogènes à l'Aide de Tâches Hiérarchiques

    Get PDF
    Task-based systems have gained popularity because of their promise of exploiting the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph which is not necessarily adapted for execution on heterogeneous systems. A standard approach is to find a trade-off between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance. To address these problems, we extend the STF model in the StarPU runtime system to enable tasks subgraphs at runtime. We refer to these tasks as hierarchical tasks. This approach allows for a more dynamic task graph. This extended model combined with an automatic data manager allows to dynamically adapt the granularity to meet the optimal size of the targeted computing resource. We show that the hierarchical task model is correct and we provide an early evaluation on shared memory heterogeneous systems, using the Chameleon dense linear algebra libraryLes systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleinement la puissance de calcul des architectures hétérogènes complexes. Un modèle de programmation courant est le modèle de soumission séquentielle de tâches (Sequential Task Flow, STF) qui malheureusement ne peut manipuler que des graphes de tâches statiques. Ceci conduit potentiellement à un surcoût lors de la soumission, et le graphe de tâches statique n'est pas nécessairement adapté pour s'exécuter sur un système hétérogène. Une solution standard consiste à trouver un compromis entre la granularité permettant d'exploiter la puissance des accélérateurs et celle nécessaire à la bonne performance des CPUs. Pour répondre à ces problèmes, nous proposons d'étendre le modèle STF fourni par le support d'exécution StarPU en y ajoutant la possibilité de transformer certaines tâches en sous-graphe durant l'exécution. Nous appelons ces tâches des tâches hiérarchiques. Cette approche permet d'exprimer des graphes de tâche plus dynamiques. En combinant ce nouveau modèle à un gestionnaire automatique des données, il est possible d'adapter dynamiquement la granularité pour fournir une taille optimale aux différentes ressources de calcul ciblées. Nous montrons dans ce rapport que le modèle des tâches hiérarchiques est valide ainsi qu'une première évaluation de ses performances en utilisant la bibliothèque d'algèbre linéaire dense Chameleon
    corecore